home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
C/C++ Users Group Library 1996 July
/
C-C++ Users Group Library July 1996.iso
/
vol_100
/
167_01
/
fgrep.doc
< prev
next >
Wrap
Text File
|
1985-08-19
|
28KB
|
739 lines
/*
HEADER: CUG
TITLE: Parallel Pattern Matching and FGREP
VERSION: 1.00
DATE: 09/20/85
DESCRIPTION: "Development of algorithm used in FGREP, a full
emulation of Unix's 'fgrep' utility."
KEYWORDS: fgrep, grep, filter, UNIX, pattern-matching
SYSTEM:
FILENAME: FGREP.DOC
WARNINGS:
CRC: xxxx
SEE-ALSO: FGREP.C
AUTHORS: Ian Ashdown - byHeart Software
COMPILERS:
REFERENCES: AUTHORS: Bell Telephone Laboratories;
TITLE: UNIX Programmer's Manual Vol. 1, p. 166;
AUTHORS: A.V. Aho & M.J. Corasick;
TITLE: 'Efficient String Matching: An Aid to
Bibliographic Search'
Communications of the ACM
pp. 333 - 340, Vol. 18 No. 6 (June '75);
ENDREF
*/
/*-------------------------------------------------------------*/
Ian Ashdown
byHeart Software
1089 West 21st Street
North Vancouver
British Columbia V7P 2C6
Canada
Date: April 12th, 1985
Parallel Pattern Matching and FGREP
-----------------------------------
by Ian Ashdown
byHeart Software
*****************************************************************
* *
* NOTE: This manuscript was published in the September 1985 *
* issue of Doctor Dobb's Journal. It may be reproduced *
* for personal, non-commercial use only, provided that *
* the above copyright notice is included in all copies. *
* Copying for any other use without previously obtaining *
* the written permission of the author is prohibited. *
* *
*****************************************************************
Preparing an index to a large technical book. Finding all
references to a person in several years' worth of a company's
minutes of meetings. Searching for all occurrences of a specific
sequence of events in the volumes of data obtained from a
scientific experiment. All of these problems and many more are
examples of searching for patterns in a set of data. The question
is, what is the most efficient search algorithm to use?
For single patterns, such as searching for the phrase
"binary tree" in a text file, there are two of particular
interest: the Boyer-Moore Algorithm (reference 5) and the
Knuth-Morris-Pratt Algorithm (reference 6). Both are fast and
reasonably simple to implement. However, neither is appropriate
when more than one pattern at a time must be searched for.
One algorithm that is appropriate can be found in a paper
published in the June '75 issue of "Communications of the ACM"
(reference 1). Entitled "Efficient String Matching: An Aid to
Bibliographic Search" and written by Alfred Aho and Margaret
Corasick of Bell Laboratories, the paper presents "a simple and
efficient algorithm to locate all occurrences of any of a finite
number of keywords in a string of text". Without modification to
the algorithm, a "keyword" can be taken to mean any pattern, and
"a string of text" to mean any sequence of symbols, be it ASCII
text, digitized data obtained from a video camera, or whatever.
The algorithm has some interesting features. For
instance, most pattern matching schemes must employ some form of
backtracking, or rescanning of the string when an attempted match
to a pattern fails or multiple patterns are to be matched. The
Aho-Corasick algorithm differs in that it searches for all of the
patterns in parallel. By doing so, it can process the string in
one pass without any backtracking.
The algorithm also recognizes patterns that overlap in
the string. For example, if the patterns are "he", "she" and
"hers", the algorithm will correctly identify matches to all
three in the string "ushers".
As to speed of execution, it is independent of the number
of patterns to be matched! This perhaps surprising feature is a
direct result of the patterns being searched for in parallel.
Curiously, this algorithm has all but disappeared from
the literature of computer science. Of the hundreds of textbooks
and articles written since that discuss pattern matching, only a
few reference the paper, and none that I know of present the
Aho-Corasick algorithm itself. Unless you have access to back
issues of "Communications of the ACM", you will likely never
encounter it.
On the other hand, if you work with UNIX you may have
used a utility based on the algorithm: "fgrep". This useful tool
searches for and displays all occurrences of a set of keywords
and phrases in one or more text files. While it does not accept
wildcard characters in its patterns like UNIX's more flexible
"grep" utility, it does illustrate the speed of the Aho-Corasick
algorithm in comparison with other pattern matching methods.
Typically, "fgrep" is five to ten times faster than "grep" in
searching files for fixed string patterns.
Given its general usefulness, I thought it appropriate to
reintroduce Aho and Corasick's work in this article. At the same
time, it seemed a shame that "fgrep" has been restricted to the
UNIX operating system. Therefore, the source code in C for a full
implementation (reference 4) of "fgrep" has been included. This
serves not only to demonstrate the algorithm, but also to bring
the idea of a public domain UNIX one small step closer to
reality.
Inside The Machine
------------------
The Aho-Corasick algorithm consists of two phases:
building a "finite state automaton" (FSA for short), then running
a string of data through it, with each consecutive symbol being
considered a separate input. Before analyzing these phases, a
quick summary of what finite state automata are and how they work
is in order. (For a more detailed discussion, reference 2 is
recommended.)
In essence, an FSA is a conceptual machine that can be in
any one of a finite number of "states". It also has a set of
"state transition" rules and a set of "inputs", which together
with the set of states define what inputs cause the machine to
change from one state to another. Finally, since a machine serves
no purpose without producing some output, an FSA has one or more
"terminal" states. Output is produced only when the FSA enters
such states.
A physical example of an FSA could be a light switch.
This device has two states, ON and OFF. Its inputs consist of
someone pushing the switch UP or DOWN. ON is a terminal state,
where the associated lamp produces visible radiation. The state
transition rules can be stated in a simple truth table:
+------------------+
| | OFF | ON |
|----|------|------|
| UP | ON | -- |
|DOWN| -- | OFF |
+------------------+
Alternatively, this could be shown as a "state transition
diagram":
DOWN
+-------------+
| |
V |
+-----+ UP ******
+-->| OFF |------>* ON *<---+
| +-----+ ****** |
| | | |
| DOWN | | UP |
+------+ +------+
where no state transition on a particular input (as shown in the
truth table) is equivalent to a transition from a state to itself
(as shown in the diagram).
Getting ahead of ourselves for a moment, look at Figure
1a, which shows part of an FSA (the other parts are shown in
Figures 1b and 1c, and will be explained shortly). By having
sequences of states, it is possible for the FSA to recognize
patterns in an input stream. For example, presenting the string
"she" to the FSA shown in Figure 1a would cause it to change from
State 0 to States 3, 4 and 5 as each input symbol of the string
is processed. State 5 is a terminal state that indicates the
pattern "she" was recognized.
There are two types of FSA that can be built, one
"nondeterministic" (Algorithm 1) and the other "deterministic"
(Algorithm 2). The nondeterministic one (NFSA for short) uses
functions called "go_to", "failure" and "output", while the
deterministic FSA (DFSA) uses "move" and "output". The difference
between the two is that while the NFSA can make one or more state
transitions per input symbol, the DFSA makes only one. With fewer
transitions to make, a DFSA can process its input in less time
than an equivalent NFSA.
The "go_to" function accepts as input parameters the
current state of th